21 research outputs found
XEngine : Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments
Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained
heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of
these checkpoints is a non-trivial problem and poses a challenge to the programmer—improper or excessive
recomputations negate the benefit of checkpointing.
In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices
in low memory environments by determining checkpoints and recomputations of tensors. Our approach
selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks
taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic
program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare
our MIQP solver XEngine against Checkmate [12], a mixed-integer linear programming (MILP) approach
that solves recomputation on a single device. Our solver finds solutions that are up to 22.5% faster than the
fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find
valid schedules for networks making use of both central processing units and graphics processing units if
memory limitations do not allow scheduling exclusively to the graphics processing unit
Parallel Multi-Hypothesis Algorithm for Criticality Estimation in Traffic and Collision Avoidance
Due to the current developments towards autonomous driving and vehicle active
safety, there is an increasing necessity for algorithms that are able to
perform complex criticality predictions in real-time. Being able to process
multi-object traffic scenarios aids the implementation of a variety of
automotive applications such as driver assistance systems for collision
prevention and mitigation as well as fall-back systems for autonomous vehicles.
We present a fully model-based algorithm with a parallelizable architecture.
The proposed algorithm can evaluate the criticality of complex, multi-modal
(vehicles and pedestrians) traffic scenarios by simulating millions of
trajectory combinations and detecting collisions between objects. The algorithm
is able to estimate upcoming criticality at very early stages, demonstrating
its potential for vehicle safety-systems and autonomous driving applications.
An implementation on an embedded system in a test vehicle proves in a
prototypical manner the compatibility of the algorithm with the hardware
possibilities of modern cars. For a complex traffic scenario with 11 dynamic
objects, more than 86 million pose combinations are evaluated in 21 ms on the
GPU of a Drive PX~2
Acceleration of Multiresolution Imaging Algorithms: A Comparative Study
Abstract—In this paper we consider a multiresolution filter and its realization on the Cell BE and GPUs. We not only present common and specific optimization strategies undertaken for obtaining maximum performance on these architectures, but also how to obtain a speedup of 6.57x and 33.24x compared to an optimized OpenMP baseline implementation. Furthermore, we also undertake automated configuration space exploration of different partitioning possibilities for selection of best tiling parameters. I
FSM-controlled architectures for linear invasion
Abstract—Invasive computing is a novel concept in multiprocessor architecture and programming. Invasion will become an important step towards self-organizing behavior which will be needed in the next generation of massively parallel MPSoCs with unrivaled performance and resource efficiency numbers as one of the main challenges for MPSoC apart from their programming. In this paper we introduce and model a finite state machine for controlling the invasive process in different architectural granularities. The applicability of our FSM is tested in case studies for a reconfigurable MPSoC platform and a fine-grained platform. The results show substantial flexibility gains with only marginal additional hardware cost